Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use curl_headers and curl_output for Livecheck strategies. #15351

Merged
merged 2 commits into from
May 8, 2023

Conversation

reitermarkus
Copy link
Member

@reitermarkus reitermarkus commented May 2, 2023

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes? Here's an example.
  • Have you successfully run brew style with your changes locally?
  • Have you successfully run brew typecheck with your changes locally?
  • Have you successfully run brew tests with your changes locally?

#15099 seems to be stalled, so I'm adding the curl_head usage here. Also switch to curl_output instead of directly calling curl_with_workarounds.

samford
samford previously requested changes May 2, 2023
Copy link
Member

@samford samford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #15117 (review), #curl_head is catered to the needs of CurlDownloadStrategy and isn't currently suitable for livecheck (namely, HeaderMatch).

For example, #curl_head will retry a request if a Content-Disposition header isn't present. This may be appropriate behavior for CurlDownloadStrategy but it doesn't make sense for livecheck, where we overwhelmingly check Location headers (and only a handful of those checks also provide a Content-Disposition header). Any retry behavior needs to be tied to the actual header(s) we're checking and #curl_head doesn't have a way of understanding that or controlling that behavior. As it stands, the hundreds of checks that check a Location header but don't contain responses with a Content-Disposition header would end up needlessly retrying with a GET request.

Besides that, the retry behavior is a mandatory part of #curl_head and there's currently no way to opt out or specify which request method to use (e.g., GET). There will be scenarios in the future where we specify the request method in a livecheck block (after the options feature is available) and that should effectively disable any automatic retry behavior. This would ensure that we make the correct request the first time, reducing the burden on upstream servers.

With that in mind, I'm of the opinion that the retry logic that's currently in #curl_head should be moved into CurlDownloadStrategy#resolve_url_basename_time_file_size, where it would check for a failed request or the lack of a Content-Disposition header and call #curl_head again with a GET request. This would make #curl_head a general-purpose method that could be used in livecheck. In this scenario, we would also add contextually-appropriate error-handling in livecheck, where #curl_head (or #page_headers) would be run again.

I'm not explicitly against using a general-purpose #curl_head method in Strategy#page_headers but, as mentioned, #curl_head would need some rework and we would still need some contextually-appropriate retry logic in livecheck (probably in HeaderMatch). However, I am against using #curl_head in livecheck as-is.

@reitermarkus
Copy link
Member Author

the hundreds of checks that check a Location header but don't contain responses with a Content-Disposition header would end up needlessly retrying with a GET request.

Currently they are all using GET, potentially needlessly if HEAD would suffice.

Also, I would imagine most Locations point to a file which should have a Content-Disposition header anyways.

@MikeMcQuaid
Copy link
Member

There will be scenarios in the future where we specify the request method in a livecheck block (after the options feature is available) and that should effectively disable any automatic retry behavior.

This should not block on potential future scenarios, that's not very fair on @reitermarkus.

With that in mind, I'm of the opinion that the retry logic that's currently in #curl_head should be moved into CurlDownloadStrategy#resolve_url_basename_time_file_size, where it would check for a failed request or the lack of a Content-Disposition header and call #curl_head again with a GET request. This would make #curl_head a general-purpose method that could be used in livecheck.

@samford am I right in saying this is the sole blocker, from your perspective, on this PR being merged? If so, does this seem reasonable to you @reitermarkus?

@samford
Copy link
Member

samford commented May 3, 2023

Currently they are all using GET, potentially needlessly if HEAD would suffice.

I agree that we should use HEAD for #page_header requests when possible but what I'm trying to explain is that #curl_head isn't capable of only using HEAD when it's sufficient for livecheck. Due to the Content-Disposition retry logic in #curl_head, this PR would make it so that 4 out of 5 HeaderMatch checks would issue HEAD requests before doing the same GET requests that we're currently doing. HEAD would work fine in almost all of the related cases (i.e., there aren't many that need a GET request) but #curl_head will unnecessarily retry with GET because there isn't a Content-Disposition header in any responses.

Also, I would imagine most Locations point to a file which should have a Content-Disposition header anyways.

As mentioned in #15117 (review), we have ~231 checks where only a Location header is present, ~27 where both a Location and Content-Disposition are present, and ~21 that only have Content-Disposition. In practical terms, ~83% of HeaderMatch checks would issue requests with HEAD, discard them, and retry with GET due to the Content-Disposition behavior.

This is worse than the status quo because it doubles the requests and we see no benefit from it. Keep in mind that ~88% of existing HeaderMatch checks contain a Location header and we follow the redirections, so we would go from 2+ requests to 4+ requests.


This should not block on potential future scenarios, that's not very fair on @reitermarkus.

Specifying the request method in a livecheck block isn't directly related to this PR, so I shouldn't have mentioned it. I only brought it up because the changes to #curl_head that would make this PR feasible are essentially the same as what we would need to support that use case.

Am I right in saying this is the sole blocker, from your perspective, on this PR being merged?

Basically yes but Markus and I may have differing views on what #curl_head should and shouldn't do.

I agree with the idea of using HEAD requests for Strategy#page_headers/HeaderMatch and I'm fine with using a general-purpose #curl_head method in #page_headers. However, I don't agree with using #curl_head if it includes retry logic. The conditions where it's appropriate to retry the request with GET are specific to the context where #curl_head is called but #curl_head has no knowledge of that context. Currently, it uses the conditions that are appropriate for CurlDownloadStrategy, which aren't appropriate for livecheck.

From my perspective, the #curl_head changes for this to be feasible are:

  • Remove the retry logic from #curl_head and handle it in CurlDownloadStrategy where #curl_head is called.
  • Add an optional parameter to #curl_head to make it use GET instead of HEAD. This is necessary to be able to move the retry logic outside of #curl_head.
  • Add appropriate retry logic to HeaderMatch, since it has the context we need to understand when to retry a request with GET (#page_headers doesn't have enough context). This is basically HeaderMatch: Retry with a GET request #15099 (maybe with some small tweaks).

@MikeMcQuaid
Copy link
Member

Am I right in saying this is the sole blocker, from your perspective, on this PR being merged?

Basically yes but Markus and I may have differing views on what #curl_head should and shouldn't do.

Ok, I'll take that as "yes" and @reitermarkus, assuming you're OK with that, please progress accordingly and we can ✅ and merge.

  • Add an optional parameter to #curl_head to make it use GET instead of HEAD. This is necessary to be able to move the retry logic outside of #curl_head.

This makes very little sense to me. The entire point of that method is to make a HEAD request, no?

@samford
Copy link
Member

samford commented May 3, 2023

This makes very little sense to me. The entire point of that method is to make a HEAD request, no?

The goal of the method is to fetch/return only the headers of the response(s). head in this case refers to the response head (i.e., the headers), as opposed to the response body (the content). It's easy to conflate it with the HEAD request method, which it uses by default.

The idea of an optional parameter to use GET is in line with curl's --head option, insomuch as it will use HEAD unless --request GET is also specified.

@MikeMcQuaid
Copy link
Member

The goal of the method is to fetch/return only the headers of the response(s). head in this case refers to the response head (i.e., the headers)

In that case: curl_headers feels more intuitive.

@samford
Copy link
Member

samford commented May 3, 2023

The goal of the method is to fetch/return only the headers of the response(s). head in this case refers to the response head (i.e., the headers)

In that case: curl_headers feels more intuitive.

I agree. Most people looking at the method are likely to think the same thing that you did and curl_headers would avoid that.

@samford
Copy link
Member

samford commented May 3, 2023

Also, just to be clear, we don't need to handle the livecheck retry logic in this PR. All that's needed here is to rework #curl_head as described above.

If we move forward with the #curl_head changes, I can rebase #15099 afterward, make a few tweaks, and we can move forward with that. The reason why I didn't rework it to use #curl_head when asked previously is because #curl_head isn't appropriate yet. I meant to explain it in that PR but was distracted with other things and forgot.

@reitermarkus reitermarkus enabled auto-merge May 6, 2023 01:42
@reitermarkus reitermarkus requested a review from samford May 6, 2023 01:42
@reitermarkus reitermarkus changed the title Use curl_head and curl_output for Livecheck strategies. Use curl_headers and curl_output for Livecheck strategies. May 6, 2023
@MikeMcQuaid
Copy link
Member

Looks good to me, thanks @reitermarkus!

@reitermarkus reitermarkus dismissed samford’s stale review May 8, 2023 22:55

Comments addressed.

@reitermarkus reitermarkus merged commit b05de92 into Homebrew:master May 8, 2023
@reitermarkus reitermarkus deleted the livecheck-head branch May 8, 2023 22:55
@bevanjkay
Copy link
Member

bevanjkay commented May 9, 2023

@reitermarkus I'm seeing an error on a number of livechecks undefined method success?' for nil:NilClass
It looks like the status is returning empty from the update to #curl_headers

Error: midiview: undefined method `success?' for nil:NilClass

@samford
Copy link
Member

samford commented May 9, 2023

I was coming to submit my review and explain the technical issues here (I just finished collecting data, testing, etc.) but apparently didn't make it in time. As-is, this breaks all the HeaderMatch checks and that would have been clear if even one HeaderMatch check was tested.

I'll create a follow-up PR with fixes and explanations (#15389).

@MikeMcQuaid
Copy link
Member

As-is, this breaks all the HeaderMatch checks and that would have been clear if even one HeaderMatch check was tested.

@samford That this doesn't happen automatically with any of our CI suites is the problem here, not that @reitermarkus missed this or that you didn't review it in time.

You need to figure out a way of automating things that doesn't rely on you catching things before they are merged. It's a reasonable assumption for Homebrew maintainers to be able to merge a PR if the code looks good and the CI looks good.

@github-actions github-actions bot added the outdated PR was locked due to age label Jun 9, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
outdated PR was locked due to age
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants